In [90]:
# Your code here feel free to use multiple cells and include markdown, graphs, latex or equations as you see fit.
import pandas as pd 
import plotly.offline as py 
import plotly.graph_objs as go
import numpy as np
import cufflinks as cf
import plotly.figure_factory as ff
cf.go_offline()
py.init_notebook_mode(connected=True)

import plotly.tools as tls
import matplotlib.pyplot as plt
import seaborn as sns

All date data from the sample span over 2 months, mainly January. (Bar graph shows only Dates of First Trip which had the widest range of dates).

Examining Duration between date-metrics

In [108]:
#examine different Time Delta's and create duration data
data['dur_signup-bgc'] = data['bgc_date']-data['signup_date']
data['dur_bgc-vadd'] = data['vehicle_added_date']- data['bgc_date']
data['dur_signup-vadd'] = data['vehicle_added_date']- data['signup_date']
data['dur_bgc-trip'] = data['first_completed_date']- data['bgc_date']
data['dur_vadd-trip'] = data['first_completed_date']- data['vehicle_added_date']
data['dur_signup-trip'] = data['first_completed_date']- data['signup_date']
data[['dur_signup-bgc','dur_signup-vadd','dur_bgc-vadd','dur_bgc-trip','dur_vadd-trip','dur_signup-trip']].describe(include='all')
Out[108]:
dur_signup-bgc dur_signup-vadd dur_bgc-vadd dur_bgc-trip dur_vadd-trip dur_signup-trip
count 32896 13399 13144 6137 6137 6137
mean 10 days 00:23:27.782101 15 days 00:04:37.274423 7 days 00:47:06.536822 8 days 20:40:19.162457 6 days 01:02:53.048720 12 days 00:45:03.079680
std 10 days 12:06:59.650952 13 days 23:29:51.476560 9 days 06:28:54.675947 6 days 16:16:36.465814 5 days 17:44:27.704430 7 days 17:30:26.639917
min 0 days 00:00:00 0 days 00:00:00 -30 days +00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
25% 2 days 00:00:00 4 days 00:00:00 1 days 00:00:00 4 days 00:00:00 2 days 00:00:00 6 days 00:00:00
50% 6 days 00:00:00 10 days 00:00:00 4 days 00:00:00 7 days 00:00:00 4 days 00:00:00 11 days 00:00:00
75% 15 days 00:00:00 24 days 00:00:00 10 days 00:00:00 13 days 00:00:00 9 days 00:00:00 17 days 00:00:00
max 69 days 00:00:00 72 days 00:00:00 55 days 00:00:00 30 days 00:00:00 30 days 00:00:00 30 days 00:00:00

This table describes informational metrics,quartile,and general distribution of our time data.

In [96]:
pie = pd.DataFrame(data.loc[data['first_trip_completed']==True]['city_name'].value_counts())
pie
Out[96]:
city_name
Strark 3239
Berton 2437
Wrouver 461

Most Signup Traffic is coming from Mobile

In [99]:
pie = pd.DataFrame(data.loc[data['first_trip_completed']==True]['signup_os'].value_counts())
pie.rename_axis('uv').reset_index()
#pie.iplot(labels = pie.iloc[1], values = pie.signup_os,kind='pie')
colors = ['#FEBFB3', '#E1396C', '#96D38C', '#D0F9B1']
pied = [go.Pie(labels = pie.index, values = pie.signup_os)]
fig = go.Figure(data=pied, layout = go.Layout(title= 'Incoming Channels: Sign-Up OS'))
fig['data'].update({'text':pie.index,'textposition':'auto','marker':{'colors':colors}})
py.iplot(fig)

Preferred/popular makes and models among Uber Drivers

In [100]:
fig = tls.make_subplots(rows=2, cols=1,subplot_titles=('Top Makes Uber Drivers Prefer','Top Models Uber Drivers Prefer'))
fig1 = data.loc[data['first_trip_completed']==True]['vehicle_make'].value_counts()[:6].iplot(asFigure = True,title='Top Car Manufacturers Uber Drivers Add',kind='bar')
fig.append_trace(fig1['data'][0],1,1)
This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]

In [101]:
fig2 = data.loc[data['first_trip_completed']==True]['vehicle_model'].value_counts()[:6].iplot(asFigure = True, title='Top Models Uber Drivers Prefer to Add',kind='bar')
fig.append_trace(fig2['data'][0],2,1)
py.iplot(fig)

Data Cleaning

In [102]:
print('Out of place Values:')
print('SignUp' ,len(data[data['dur_signup-trip'].dt.days<0]))
print('BGC',len(data[data['dur_bgc-trip'].dt.days<0]))
print('vehicle' ,len(data[data['dur_vadd-trip'].dt.days<0]))
Out of place Values:
SignUp 0
BGC 33
vehicle 54

Cleaning Solutions:

Make sure there no trips until after background checks and vehicles have been added.

  • Otherwise, preserve trip data, and edit BGC or VehAdds to the midpoint of SignUP and FirstTrip.
In [103]:
#edit data that claims drivers started first trip before vehicle or background checks were completed. replaced with middle of signup and first trip
#loop to fill-in missing or edit the background date column
for i, row in data.loc[(data['signup_date'].notnull()&data['first_completed_date'].notnull()&data['bgc_date'].isnull())|\
                       (data['dur_bgc-trip'].dt.days<0) | (data['dur_signup-bgc'].dt.days<0),:].iterrows():
    a = row['signup_date']
    b = row['first_completed_date']
    c = a + (b - a)/2
    data.at[i, 'bgc_date'] = pd.to_datetime(c)
#loop to fill-in missing or edit the vehicle added date column
for i, row in data.loc[(data['signup_date'].notnull()&data['first_completed_date'].notnull()&data['vehicle_added_date'].isnull())|\
                       (data['dur_vadd-trip'].dt.days<0) | (data['dur_signup-vadd'].dt.days<0),:].iterrows():
    a = row['signup_date']
    b = row['first_completed_date']
    c = a + (b - a)/2
    data.at[i, 'vehicle_added_date'] = pd.to_datetime(c)
    
#Note run duration cell again to update duration data
#Run cell directly above to make sure data is consistently in order. 

Assumptions and Notes

Ordering Issues

  • BGChecks before/after vehicle appears to have no topological order to the process (can occur concurently or out of order).
  • Entries with background checks AFTER the vehicle happened only 2.1% of this sample, so while it may not be typical, it may be okay to assume topological(ordering) exceptions.

Spanning Problem

  • It appears the sample has been selected one month for signups and 2 months for first trip. Although data spanning across a much wider seasonality and time period would be better for model fitting, other features should help in creating our model later.
In [104]:
bandv = data.loc[data['bgc_date'].notnull()&data['vehicle_added_date'].notnull()]
print('entries with background checks before vehicle added:',len(bandv.loc[data['bgc_date']<data['vehicle_added_date']]))
print('entries with background checks after vehicle added:', len(bandv.loc[data['bgc_date']>data['vehicle_added_date']]))
entries with background checks before vehicle added: 10337
entries with background checks after vehicle added: 280
In [105]:
# calculating conversion percentages
count= data.count()
perc= round(count/data.shape[0] *100,1)

Only 11.2% of Original Signups end up completing their first trip

In [106]:
titl = f'Conversion of Signup to First-Completed-Trip<br><b>{perc[5]}%->{perc[6]}%->{perc[10]}%</b><br>(% relative to total signups)'
fig = data[['signup_date','bgc_date','vehicle_added_date','first_completed_date']].count().iplot(kind='bar',title=titl,asFigure= True) 
fig['data'].update({'text':data[['signup_date','bgc_date','vehicle_added_date','first_completed_date']].count(),'textposition':'auto', 'opacity':0.8,'marker':{'color':'rgb(158,202,225)'}})
py.iplot(fig)

Friction between conversions. Why might that be?

The largest relative drop is in vehicle_added, possibly due to:

  • Failed Inspections
  • Shared Accounts/Vehicles
  • Insufficient Motivation to reach a Inspection Dealer

Notes:

  • Perhaps duration between BackgroundCheck-date and VehicleAdded-date are not as important, since those two steps of the process can be done concurrently, without a topological order.
  • Time-to-completion in the 2 above stages is likely not as much of a factor as in the other stages. Obviously, getting the vehicle inspected and approved may be signficant friction for some applicants.
In [111]:
#copy of cell from before; recalculate duration values to cleaned data
data['dur_signup-bgc'] = data['bgc_date']-data['signup_date']
data['dur_bgc-vadd'] = data['vehicle_added_date']- data['bgc_date']
data['dur_signup-vadd'] = data['vehicle_added_date']- data['signup_date']
data['dur_bgc-trip'] = data['first_completed_date']- data['bgc_date']
data['dur_vadd-trip'] = data['first_completed_date']- data['vehicle_added_date']
data['dur_signup-trip'] = data['first_completed_date']- data['signup_date']
In [113]:
fig = data['dur_signup-bgc'].dt.days.iplot(asFigure = True, kind='histogram', title = 'Days to Completion: Frequency of Successful Conversion by days<br>(Hover Mouse to compare values)')
fig['layout'].update({'barmode':'stack'})
t1 = go.Histogram(x = data['dur_signup-vadd'].dt.days, name = 'dur_signup-vehAdd')
t2 = go.Histogram(x= data['dur_signup-trip'].dt.days, name = 'dur_signup-trip')
t3 =  go.Histogram(x= data['dur_vadd-trip'].dt.days ,name = 'dur_vehAdd-trip')
t4 = go.Histogram(x= data['dur_bgc-trip'].dt.days ,name = 'dur_bgc-trip')
fig['data'].append(t1)
fig['data'].append(t2)
fig['data'].append(t3)
fig['data'].append(t4)
py.iplot(fig)

All duration data appears to have a right skewed distribution. Clueing in-- the sooner the applicant is motivated to move on to the next step of the process-- the more likely they are to be successful and therefore-- have a successful first trip.

  • Build a statistical model to predict whether a driver that signed up will begin driving for Uber.
    • How did the model perform? Are there any caveats? How can Uber use your model to improve our product?
In [244]:
data[['dur_signup-bgc','dur_signup-vadd','dur_bgc-vadd','dur_bgc-trip','dur_vadd-trip','dur_signup-trip']].corr()
Out[244]:

Building a Statistical Model

to predict whether a new signup will begin their first drive with Uber

Probably best to use logistic regression here. Using Cross-Validation (KFold Cross Validation, which is the best technique to apply to create the most optimally trained model).

In [0]:
# Your code here feel free to use multiple cells and include markdown, graphs, latex or equations as you see fit.
X = data[['city_name' , 'signup_os' , 'signup_channel', 'duration_signup-bgc', 'duration_bgc-vadd','duration_vadd-trip']]]
y = data['first_trip_completed']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=5)
from sklearn.linear_model import LogisticRegression
lm = LinearRegression(normalize=True)
lm.fit(X_train,y_train)
mse = np.mean((lm.predict(X_test) - y_test)**2)
y_train_pred=lm.predict(X_train)
y_pred = lm.predict(X_test)

Post-Model Notes

Overfit:

This model would likely perform well on brand new data with similar characteristics (mainly same time span) as the sample it was trained on. However, I imagine that for new and more recent samples of data, there will likely be a lot of variance(due to seasonality, changes outside the sample data, etc), resulting in an overfit model that is suboptimal as predicting new data.

Solution:

With Uber's huge data warehouse, it would be interesting to construct a low bias training model and feed it lots of data (over 100 mil training examples on a neural network for example) and it would most likely result in a much better performing model great at forecasting future data.

Present your analysis from above.

In [0]:
# Further code or markdowns
  • Optional: Build a model to forecast the number of new drivers we expect to start every week.
    • How would you validate a model like this? What other information would you use if you had access to all of Uber's data?
    • Feel free to include model results in your presentation and analysis.
In [0]:
#neural network
df['Date'] = pd.to_datetime(df['Date']) - pd.to_timedelta(7, unit='d')
df = df.groupby(['Name', pd.Grouper(key='Date', freq='W-MON')])['Quantity']
       .sum()
       .reset_index()
       .sort_values('Date')